refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
Conversation
Partitions the epoch-boundary EWRAP work unit — previously one monolithic
work unit that materialised O(active_accounts) in memory (rewards map,
deltas, logs, applied_rewards) — into three phase-specific work units that
each commit independently:
* EwrapPrepare: global classification (pools/dreps/proposals), MIRs,
enactment + refund visitors for non-account entities, emits EpochEndInit
seeding EpochState.end with the prepare-time globals and zeroed reward
accumulators.
* EwrapShard(i): range-scoped (first-byte prefix bucket) load of pending
rewards + accounts, runs rewards + drops visitors per account, emits
EpochEndAccumulate with the shard's reward contribution.
* EwrapFinalize: reads the accumulated EpochState.end, emits EpochWrapUp
(which transitions rolling/pparams snapshots and clears ewrap_progress).
Cross-shard handoff piggy-backs on EpochState rather than a new entity:
ewrap_progress: Option<u32> is the durable cursor and EpochState.end
accumulates across shards via the new deltas.
WorkBuffer gains EwrapShardingBoundary{shard_index, total_shards} and
EwrapFinaliseBoundary states; pop_work now takes ewrap_total_shards from
CardanoConfig (default 16). EpochEndAccumulate has an idempotency guard
keyed on ewrap_progress so shard re-execution after a crash is safe.
Detection-only crash recovery at initialize time logs a warning when
ewrap_progress is set; full block-rehydration resume is flagged as TODO.
Memory tests in tests/memory.rs verify both fjall and redb3 honour
range-scoped iter_entities with O(1) heap — the load-bearing property for
the shard design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n ESTART Decouple two responsibilities that were tangled in EwrapPrepareWorkUnit: the global epoch-boundary entity processing (now plain `Ewrap`) and the structural opening of the `EpochState.end` slot (now done by ESTART's `EpochTransition`). Ewrap's `EpochEndInit` delta keeps its overwrite semantics; it now writes into a default-seeded slot rather than from None. Also adds `prev_end` / `prev_ewrap_progress` undo fields to `EpochTransition` (serialized, like the other prev_* fields) so a rollback after restart correctly restores them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-account leg of the epoch close was named after its position in the EWRAP pipeline; AccountShard names what it actually does — apply rewards and pool/drep delegator drops over a key-range slice of the account namespace. Also renames the related symbols (BoundaryWork::load_shard / commit_shard → load_account_shard / commit_account_shard, WorkBuffer::EwrapShardingBoundary → AccountShardingBoundary, InternalWorkUnit::EwrapShard → AccountShard). The user-facing `ewrap_total_shards` config field is intentionally preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line The epoch-boundary sequence is now AccountShard ×N → Ewrap → EwrapFinalize (was Ewrap → AccountShard ×N → EwrapFinalize). Per-account work settles first; the global Ewrap phase then patches the prepare-time fields onto an EpochState.end that already has its reward accumulators populated. State machine: WorkBuffer::pop_work transitions reordered, and on_ewrap_boundary now takes ewrap_total_shards so the restart-at-boundary entry can construct AccountShardingBoundary directly. The total_shards == 0 defensive branch now skips to EwrapBoundary (global phase) instead of EwrapFinaliseBoundary. Delta semantics: - EpochEndInit::apply is now a PATCH — writes only the prepare-time fields (pool counts, epoch_incentives, MIR amounts, proposal refunds) and leaves the accumulator fields alone. ewrap_progress is no longer touched by this delta. Dropped the unused prev_ewrap_progress field. - EpochEndAccumulate::apply treats ewrap_progress = None as the natural starting state for shard 0 (unwrap_or(0) as the expected cursor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta The boundary close is now a single Ewrap work unit: it runs the global visitors AND emits EpochWrapUp carrying the assembled final EndStats (prepare-time fields combined with the AccountShard-populated accumulator fields). The wrap-up visitor now constructs the final stats locally instead of routing them through a separate EpochEndInit delta. Side-benefits: one fewer state-machine state, one fewer delta type, one fewer commit cycle. Atomicity also improves — the boundary close is now a single state-writer commit, so a crash between Ewrap and EwrapFinalize is no longer possible. Test fixture in tests/epoch_pots/main.rs restructured to match the post-reorder pipeline: accumulator reset gates on AccountShard shard_index == 0; rewards CSV is dumped on the Ewrap arm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the boundary-pipeline reorder (AccountShard runs before Ewrap), the first epoch's AccountShard hits `EpochEndAccumulate::apply` with `entity.end == None` because Genesis bootstrapped the EpochState before ESTART's `EpochTransition` had a chance to seed the slot. Seed `end = Some(EndStats::default())` directly in Genesis to match the invariant ESTART maintains for every subsequent epoch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSplits epoch-boundary work into per-shard and global finalize phases for Ewrap, Estart, and RUPD; adds shard partitioning, persisted shard progress cursors, sharded loading/commit APIs, and a centralized lifecycle runner (initialize → per-shard loop → finalize) to enable resumable, memory-bounded boundary processing. Changes
Sequence Diagram(s)sequenceDiagram
participant Sync as sync::run_lifecycle
participant WU as WorkUnit (Ewrap/Estart/Rupd)
participant State as StateStore (persist)
participant Archive as ArchiveStore
Sync->>WU: initialize(domain)
Note over WU: determine total_shards, init globals
loop for shard_index in 0..total_shards-1
Sync->>WU: load(domain, shard_index)
Note over WU: load_shard(ranges): pending rewards/snapshot for ranges
Sync->>WU: compute(shard_index)
Note over WU: compute_shard_deltas(): apply per-account visitors
Sync->>WU: commit_wal(domain, shard_index) [opt]
Sync->>WU: commit_state(domain, shard_index)
Note over WU: commit_shard: write shard-scoped entities, apply progress delta
WU->>State: write shard state (PendingRewardState, deltas)
WU->>Archive: write shard archive logs
Sync->>WU: commit_archive(domain, shard_index)
Sync->>WU: commit_indexes(domain, shard_index)
end
Sync->>WU: finalize(domain)
Note over WU: load_finalize + commit_finalize: process MIRs/refunds, write final archive, advance/clear cursors
WU->>State: write final epoch snapshot & global entities
WU->>Archive: write finalized archive logs
sequenceDiagram
participant Shard as Per-shard processing
participant Progress as EpochState Progress
participant Global as Finalize pass
Shard->>Shard: load_shard(ranges)
Shard->>Shard: compute_shard_deltas() (apply rewards/drops)
Shard->>Progress: apply(EWrapProgress / RupdProgress)
Shard->>Shard: commit_shard (state + archive)
Note over Shard,Progress: shard accumulators added into EpochState.end
Global->>Global: load_finalize()
Global->>Global: compute_global_deltas() (MIRs/refunds/wrapup)
Global->>Progress: apply(EpochWrapUpV2 / EpochTransitionV2)
Global->>Shard: clear or verify progress cursors
Global->>Shard: commit_finalize (final archive + snapshot)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pulls the AccountShard work unit out of `ewrap/` into a peer module `ashard/` (matching the layout of `estart/`, `rupd/`, `roll/`, `genesis/`). The shared `BoundaryWork` / `BoundaryVisitor` infrastructure and the drops visitor (used by both phases) stay in `ewrap/`; `ashard/` imports them. Moves: `rewards.rs`, `shard.rs`, `AccountShardWorkUnit` (from `work_unit.rs`), and the `load_*` / `commit_*` impl blocks. Visibility on shared `BoundaryWork` helpers (`new_empty`, `load_pool_data`, `load_drep_data`, `stream_and_apply_namespace`) widened from private to `pub(crate)`. The `ending_state` field also widened to `pub(crate)` so peer modules can mutate it (e.g. `wrapup.flush` already does this). Method/identifier renames to match the new module path: - `BoundaryWork::load_account_shard` → `load_ashard` - `BoundaryWork::commit_account_shard` → `commit_ashard` - `WorkUnit::name()` returns `"ashard"` Type name `AccountShardWorkUnit` is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments Sweeps the docstrings/comments touched in this PR for references to phases, work units, and deltas that no longer exist after the rename / reorder / merge / split sequence: - Restore the in-place explanation for the "rewards before drops" HACK in `ashard/loading.rs` (the dangling "see comment on the pre-shard path" pointed to a comment that was deleted when the prepare phase was removed). - Drop "prepare phase" / "finalize phase" wording from `BoundaryWork` field docstrings, `commit_ewrap` comments, and `loading.rs` section dividers — neither phase exists; there's only Ewrap (global + close) and AccountShard (per-account). - Update the ESTART `EpochTransition` description in `work_units.md` so it reflects the post-merge data flow: AccountShards populate the accumulators directly, then Ewrap reads them back and emits `EpochWrapUp` with the final `EndStats` (no `EpochEndInit` patch step anymore). - Rename `compute_prepare_deltas` → `compute_ewrap_deltas`. The "prepare" name was a leftover from the `EwrapPrepare` work unit; the method is now the only Ewrap-phase compute helper. - Tighten `load_pending_rewards_range` docstring; flag that the `None` branch is currently unused. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shard-related identifiers and comments were named after the legacy EWRAP pipeline that bundled the global epoch-boundary work and the per-account shards together. With AccountShard now a distinct work unit in its own module, those names are misleading. Rename to use the `ashard` prefix consistently with the module path: - `CardanoConfig::ewrap_total_shards` → `ashard_total` - `CardanoConfig::DEFAULT_EWRAP_TOTAL_SHARDS` → `DEFAULT_ASHARD_TOTAL` - `EpochState::ewrap_progress` → `ashard_progress` - `prev_ewrap_progress` → `prev_ashard_progress` on `EpochEndAccumulate`, `EpochWrapUp`, and `EpochTransition` - `WorkBuffer::receive_block` / `on_ewrap_boundary` / `pop_work` parameter `ewrap_total_shards` → `ashard_total` - Error messages in `ashard/shard.rs` updated to match. Also fixes comment / doc misattributions where "EWRAP" was used for work that's now in `AccountShard`: - `PendingRewardState` / `DequeueReward` are consumed by `AccountShard`, not Ewrap. - `PendingMirState` / `DequeueMir` are consumed by Ewrap (clarified). - `AppliedReward` and the `applied_rewards` field are populated during AccountShard, not Ewrap. - RUPD's docstring now says rewards are consumed by `AccountShard`. - Crash-recovery wording in `lib.rs` says "mid-boundary" instead of "mid-EWRAP" since the cursor specifically tracks AccountShard progress. BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set `ewrap_total_shards` need to rename the key to `ashard_total`. Users relying on the default (omitted) are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the type and variant names with the module path convention: - struct `AccountShardWorkUnit` → `AShardWorkUnit` - enum variant `CardanoWorkUnit::AccountShard` → `AShard` - enum variant `InternalWorkUnit::AccountShard` → `AShard` - WorkBuffer state `AccountShardingBoundary` → `AShardingBoundary` - module re-export and all callers updated to match - prose / docstrings / log messages also use `AShard` consistently The module path is `crate::ashard`, so the type now reads as `ashard::AShardWorkUnit`. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: the user-facing config name should be self-explanatory in `dolos.toml`. Renames everywhere for consistency: - `CardanoConfig::ashard_total` field → `account_shards` - `CardanoConfig::ashard_total()` accessor → `account_shards()` - `CardanoConfig::DEFAULT_ASHARD_TOTAL` → `DEFAULT_ACCOUNT_SHARDS` - WorkBuffer parameters and error messages updated to match. BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set this option (under any prior name from this PR) need to use `account_shards`. Users relying on the default are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against a config change to `account_shards` corrupting an
in-flight boundary. Previously, if dolos crashed mid-boundary and the
operator changed `account_shards` between crash and restart, the resume
would re-partition the account key space with the new count, mismatching
the cursor's already-committed shards.
Fix: snapshot the boundary's shard count into state at the first
`EpochEndAccumulate` apply. The persisted total is authoritative for the
duration of the in-flight boundary; the new config value only takes
effect on the next boundary.
Changes:
- New `AShardProgress { committed, total }` struct stored at
`EpochState.ashard_progress: Option<AShardProgress>` (was
`Option<u32>`).
- `EpochEndAccumulate` carries `total_shards`. Its apply validates the
delta's `total_shards` matches any previously persisted total and
surfaces an error if they diverge (would only happen if a work unit
was constructed with a stale config view).
- `EpochWrapUp` and `EpochTransition` undo fields adapted to the new
type.
- `AShardWorkUnit::load` / `commit_state` read the persisted total when
present and fall back to `config.account_shards()` for fresh
boundaries.
- `CardanoLogic` caches `effective_account_shards` (= persisted total
if a boundary is in flight, else config). Refreshed at every
`pop_work` call so `receive_block` (which has no state access) can
use the up-to-date value when constructing
`WorkBuffer::AShardingBoundary`.
- Crash-recovery wording updated to surface a clear warning when the
persisted total disagrees with current config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/cardano/src/ashard/rewards.rs (1)
114-188:⚠️ Potential issue | 🟡 MinorLog message clarity: add shard context to per-shard telemetry.
The
flushmethod (lines 114–188) now runs once per AShard, but the telemetry messages ("rewards remaining before drain", "SPENDABLE REWARDS LEFT UNPROCESSED", etc.) don't indicate shard scope. Per-shard loading correctly isolates each shard's RewardMap viaload_pending_rewards_range(state, Some(range)), so the drain and error detection are functionally sound. However, for operational visibility, these log statements should include shard index and total_shards to clarify they reflect per-shard metrics, not boundary-wide state. This is especially helpful for debugging when multiple shards report metrics in parallel.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ashard/rewards.rs` around lines 114 - 188, The per-shard logs in flush don't indicate which shard they belong to; update the flush logging to include shard context (shard_index and total_shards). Add shard_index: usize and total_shards: usize as fields on the AShard struct (or otherwise make them available on self or ctx), ensure any AShard construction/population passes these values, and then include %shard_index and %total_shards in the tracing::debug! and tracing::error! invocations (and the debug! at the end) around ctx.rewards/ drain_unspendable/ applied_reward_credentials so all per-shard telemetry shows shard scope.crates/cardano/src/model/epochs.rs (1)
980-981:⚠️ Potential issue | 🟡 MinorFix the
unused_doc_commentsrustdoc warning flagged by CI.Pipeline reports
rustdoc does not generate documentation for macro invocationson these lines. The///comment immediately above theprop_compose!macro invocation is dropped on the floor by rustdoc and triggers the lint. Convert to a regular//comment (or move it inside the closure body if you want it preserved).🛠️ Proposed fix
- /// `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None, - /// so we need a specialized generator that keeps `rolling.next` empty. + // `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None, + // so we need a specialized generator that keeps `rolling.next` empty. prop_compose! {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/model/epochs.rs` around lines 980 - 981, The rustdoc warning comes from using a /// doc comment immediately above the prop_compose! macro (related to EpochStatsUpdate::apply and rolling.live_mut which asserts rolling.next is None), so replace that leading /// comment with a non-doc comment (//) or move the explanation into the generator body/closure so rustdoc won't try to document the macro invocation; specifically update the comment above the prop_compose! invocation that references rolling.live_mut and rolling.next to use // (or relocate it inside the prop_compose! closure) to silence the unused_doc_comments lint.
🧹 Nitpick comments (12)
crates/cardano/src/ashard/rewards.rs (1)
56-77: StaleEWRAPreference in comment.The comment at lines 59–61 still says "registered at RUPD startStep time but deregistered before EWRAP". After this refactor the per-account registration check happens during
AShard, notEwrap. Suggest updating to "deregistered before AShard" (or "before the boundary's per-account leg") to keep the comment in lockstep with the renamed pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ashard/rewards.rs` around lines 56 - 77, Update the stale comment referencing "EWRAP" to reflect the new pipeline stage: change the text around the account deregistration explanation to say "deregistered before AShard" (or "before the boundary's per-account leg") so it matches where the per-account registration check now runs; the relevant area to edit is the comment above the account.is_registered() branch in the function that calls reward.total_value() and ctx.rewards.return_reward_to_treasury(total). Ensure the updated comment keeps the existing clarification about pre-filtered accounts and returned_rewards behavior.crates/cardano/src/ashard/loading.rs (1)
19-57: Optional: drop theOption<Range<EntityKey>>wrapper.The doc-comment already states
load_ashardis the only caller and always passesSome(range). Making the parameterRange<EntityKey>directly removes a never-taken branch and tightens the contract. If a future "load all" caller appears, reintroducing the option is trivial.♻️ Suggested refactor
- fn load_pending_rewards_range<D: Domain>( - &mut self, - state: &D::State, - range: Option<Range<EntityKey>>, - ) -> Result<(), ChainError> { - let pending_iter = state - .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range)?; + fn load_pending_rewards_range<D: Domain>( + &mut self, + state: &D::State, + range: Range<EntityKey>, + ) -> Result<(), ChainError> { + let pending_iter = state + .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, Some(range))?;And drop the
Some(range.clone())at the call site (line 81) to passrange.clone()directly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ashard/loading.rs` around lines 19 - 57, The function load_pending_rewards_range currently accepts Option<Range<EntityKey>> but the only caller (load_ashard) always passes Some(range), so tighten the signature to take Range<EntityKey> directly: change the parameter type in load_pending_rewards_range and update callers (e.g., in load_ashard remove Some(range.clone()) and pass range.clone() directly); ensure the body drops any pattern-matching for None (no-op) and uses the range value when calling state.iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range).tests/epoch_pots/main.rs (1)
500-513: Optional: collapse the twoif let Some(boundary)checks.The shard-index==0 reset and the per-shard extend can share a single
boundarybinding, which trims a branch and makes the relationship between "capture ending epoch" and "extend rewards" obvious.♻️ Suggested refactor
CardanoWorkUnit::AShard(shard) => { - if shard.shard_index() == 0 { - // First shard of this boundary — reset accumulator - // and capture the ending epoch. - accumulated_applied.clear(); - if let Some(boundary) = shard.boundary() { - accumulated_ending_epoch = - Some(boundary.ending_state().number); - } - } - if let Some(boundary) = shard.boundary() { + if let Some(boundary) = shard.boundary() { + if shard.shard_index() == 0 { + // First shard of this boundary — reset + // accumulator and capture the ending epoch. + accumulated_applied.clear(); + accumulated_ending_epoch = + Some(boundary.ending_state().number); + } accumulated_applied.extend(boundary.applied_rewards.iter().cloned()); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/epoch_pots/main.rs` around lines 500 - 513, Collapse the two separate `if let Some(boundary)` checks into a single binding for `boundary` when matching `CardanoWorkUnit::AShard(shard)`: inside one `if let Some(boundary) = shard.boundary()` block first check `if shard.shard_index() == 0` to clear `accumulated_applied` and set `accumulated_ending_epoch = Some(boundary.ending_state().number)`, then unconditionally call `accumulated_applied.extend(boundary.applied_rewards.iter().cloned())`; this removes the duplicated boundary lookup and makes the reset-and-extend logic contiguous.crates/cardano/work_units.md (1)
5-11: Add a language to the fenced block (MD040).
markdownlint-cli2flagged this fence as missing a language. Usetext(or any non-rendered identifier) so the diagram passes lint.📝 Suggested fix
-``` +```text Estart → Roll … → Rupd → Roll … → AShard ×N → Ewrap (open) (blocks) (RUPD) (blocks) (per-account) (global + close) │ ▼ next epoch's Estart</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@crates/cardano/work_units.mdaround lines 5 - 11, The fenced code block
containing the epoch diagram starting with "Estart → Roll … → Rupd → Roll
… → AShard ×N → Ewrap" is missing a language tag (MD040); fix it by adding a
language identifier (use "text") immediately after the opening triple backticks
so the block becomes ```text and the diagram passes markdownlint-cli2
validation.</details> </blockquote></details> <details> <summary>crates/cardano/src/lib.rs (2)</summary><blockquote> `284-341`: **Consolidate duplicated `effective_account_shards` lookup logic.** The inline computation at lines 338–341 is identical to the new helper `read_effective_account_shards` (lines 253–258). On top of that, `load_epoch::<D>(state)` is invoked twice within `initialize` (line 292 for the warnings, line 338 to derive `effective_account_shards`). A small refactor avoids the double read and keeps the “persisted total else config” rule in one place. <details> <summary>♻️ Proposed refactor</summary> Lift the helper to a free fn and reuse it (and reuse the loaded `EpochState` for the warnings): ```diff - /// Compute the effective `account_shards` value: stored - /// `ashard_progress.total` if a boundary is in flight, otherwise the - /// configured value. - fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 { - load_epoch::<D>(state) - .ok() - .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total)) - .unwrap_or_else(|| self.config.account_shards()) - } + /// Compute the effective `account_shards` value: stored + /// `ashard_progress.total` if a boundary is in flight, otherwise the + /// configured value. + fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 { + effective_account_shards_from_state::<D>(state, &self.config) + } +} + +fn effective_account_shards_from_state<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 { + load_epoch::<D>(state) + .ok() + .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total)) + .unwrap_or_else(|| config.account_shards()) } ``` And in `initialize`, compute once and reuse the loaded epoch for both the warnings and the cached value. </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/lib.rs` around lines 284 - 341, Extract and reuse the epoch load and the effective-account-shards logic: in initialize(), call load_epoch::<D>(state) once and store the resulting EpochState (if Ok) for the warnings; replace the inline computation of effective_account_shards with a call to the existing helper read_effective_account_shards (or lift that helper to a free fn if necessary) so the "persisted total else config" rule is implemented in one place; update references to use the single loaded epoch value for the tracing::warn! blocks and derive effective_account_shards from read_effective_account_shards(state, &config) (or equivalent) to remove the duplicate load_epoch::<D>(state) call. ``` </details> --- `404-408`: **Per-`pop_work` state read may be unnecessary outside boundaries.** `effective_account_shards` is refreshed on every `pop_work` call (each block cycle) via a `read_entity_typed` lookup, but it only matters at epoch boundaries. The cost is small (single point read), but you could skip the read when the cached value is non-stale by only refreshing when `ashard_progress` could have changed (i.e., before an actual boundary handoff). Worth considering if `pop_work` shows up in profiling. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/lib.rs` around lines 404 - 408, The code refreshes self.effective_account_shards on every pop_work by calling read_effective_account_shards::<D>(domain.state()), but this value only matters when the epoch boundary handoff (EpochState.ashard_progress.total) can change; avoid the repeated read by first checking whether ashard_progress has changed and only calling read_effective_account_shards when it has. Add a small cached marker on the struct (e.g., last_ashard_total or last_ashard_progress) and in pop_work compare domain.state().ashard_progress.total (or the appropriate accessor) to that cached value; if different, call read_effective_account_shards and update both effective_account_shards and the cached marker, otherwise skip the read and keep the existing effective_account_shards. ``` </details> </blockquote></details> <details> <summary>crates/cardano/src/ewrap/commit.rs (1)</summary><blockquote> `91-100`: **Full-table account scan to apply a handful of MIR rewards.** `stream_and_apply_namespace::<D, AccountState>(state, &writer, None)` walks every account in state to surface the small set with MIR-queued deltas. The inline comment acknowledges this is "effectively a targeted write via the streaming path", but on mainnet-sized account tables this is a significant cost paid every boundary, regardless of how few MIRs are pending (often zero). Consider a direct path: iterate `self.applied_mir_credentials` (or the credentials of the queued `AssignRewards` deltas), point-read each `AccountState`, apply, and write. The streaming helper can stay for the full-set namespaces (Pool/DRep/Proposal/EpochState), but accounts during Ewrap are inherently sparse. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ewrap/commit.rs` around lines 91 - 100, The current call to stream_and_apply_namespace::<D, AccountState>(state, &writer, None) performs a full-table scan of AccountState each boundary; instead implement a targeted path that iterates over self.applied_mir_credentials (or the credentials from queued AssignRewards deltas), point-reads each AccountState, applies the MIR delta, and writes the result via the same writer so only affected accounts are touched; keep stream_and_apply_namespace for full-set namespaces like Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing apply/serialize logic used by stream_and_apply_namespace to maintain invariants and error handling. ``` </details> </blockquote></details> <details> <summary>crates/cardano/src/ashard/work_unit.rs (1)</summary><blockquote> `62-129`: **Avoid recomputing `total_shards`/`range` from state in `commit_state`.** `load` computes `total_shards` and `range` from persisted state, then calls `BoundaryWork::load_ashard(..., shard_index, total_shards, range)` — so the boundary already owns these values. `commit_state` repeats the same state read and recomputation, with two minor concerns: 1. **Code duplication**: identical 8-line block in both methods (and a third copy in `crates/cardano/src/lib.rs`). 2. **Subtle drift risk**: if `commit_state` ever observed a different `ashard_progress.total` than `load` did (e.g., a shard's apply landing between the two phases of the same work unit), the recomputed `range` would diverge from what was loaded. Today this can't happen, but cross-phase reuse of the captured values is more obviously correct. Consider exposing the captured shard info from `BoundaryWork` (or storing the computed `range` in `AShardWorkUnit` after `load`) and have `commit_state` reuse it instead of re-deriving. <details> <summary>♻️ Sketch</summary> ```diff pub struct AShardWorkUnit { slot: BlockSlot, config: CardanoConfig, genesis: Arc<Genesis>, shard_index: u32, - boundary: Option<BoundaryWork>, + boundary: Option<BoundaryWork>, + /// Captured during `load` so `commit_state` reuses the exact same range. + range: Option<std::ops::Range<dolos_core::EntityKey>>, } ``` Then `load` stores `self.range = Some(range.clone())` and `commit_state` consumes it. </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ashard/work_unit.rs` around lines 62 - 129, The load/commit_state duplication and drift risk can be fixed by capturing the computed total_shards and range during load and reusing them in commit_state instead of re-reading state: after computing total_shards and range in load (before calling BoundaryWork::load_ashard), store the values on the work unit (e.g., add fields self.total_shards and/or self.range and assign them there), and then change commit_state to take the boundary via self.boundary.as_mut() and use the stored range/total_shards when calling boundary.commit_ashard::<D>(...) rather than recomputing via load_epoch; update BoundaryWork usage only to consume the already-stored range if needed. ``` </details> </blockquote></details> <details> <summary>crates/cardano/src/work.rs (2)</summary><blockquote> `276-279`: **Test `TEST_TOTAL_SHARDS = 1` doesn't cover multi-shard sequencing in this state machine.** These tests collapse `AShard` into the `EWrap` tag and set `TEST_TOTAL_SHARDS = 1`, so the multi-shard loop in `pop_work` (lines 226–245) — including the `next_index >= total_shards` transition — is exercised only with `total_shards = 1`. Worth adding at least one test with `TEST_TOTAL_SHARDS > 1` that asserts the work buffer emits exactly N `AShard` units before transitioning to `EwrapBoundary` (and that `last_point_seen` / cursor invariants hold across that transition). This catches off-by-one and ordering regressions in the boundary state machine that the current tests would silently miss. Want me to draft this test? Also applies to: 325-330 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/work.rs` around lines 276 - 279, Add a new unit test that sets TEST_TOTAL_SHARDS > 1 (instead of the current const TEST_TOTAL_SHARDS = 1) to exercise the multi-shard sequencing in the pop_work logic: drive the WorkBuffer/state machine (invoking pop_work) and assert that it emits exactly N consecutive AShard units (N == TEST_TOTAL_SHARDS) before producing an EwrapBoundary (or EWrap/Ewrap as used in this code), and verify that last_point_seen and any cursor invariants are preserved across the transition; target the test to exercise the pop_work path that contains the next_index >= total_shards branch and use the existing helpers/setup used by the other tests so it integrates with the same state machine and coverage. ``` </details> --- `120-143`: **`account_shards == 0` "skip-shards" fallback contradicts `validate_total_shards`.** `crates/cardano/src/ashard/shard.rs::validate_total_shards` rejects `0` outright ("account_shards must be >= 1"), yet both `on_ewrap_boundary` and the `PreEwrapBoundary` arm of `pop_work` treat `0` as "no shards — go straight to Ewrap". If a config bug or missing validation ever hands `0` here, you'll silently skip per-account boundary processing instead of failing fast. The defensive comment ("Shouldn't happen with a valid config") tacitly acknowledges this. Either drop the branches and let an `unreachable!()` / `assert!(account_shards >= 1)` fail loudly, or change `validate_total_shards` to accept `0` as "no shards" if that's the intended semantics. Right now the two pieces disagree. Also applies to: 210-225 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/work.rs` around lines 120 - 143, The code currently treats account_shards == 0 as a valid "no shards" path in on_ewrap_boundary (and similarly in the PreEwrapBoundary arm of pop_work), which contradicts validate_total_shards that requires account_shards >= 1; change the behavior to fail-fast: remove the special-case branch that returns EwrapBoundary when account_shards == 0 and instead assert or panic if account_shards < 1 (e.g., add assert!(account_shards >= 1) at the start of on_ewrap_boundary), keeping the normal Restart -> AShardingBoundary path for valid shard counts; alternatively, if the intended semantics are that 0 means "no shards", update validate_total_shards to accept 0 and document that invariant—pick one consistent approach and apply the same fix in the PreEwrapBoundary/pop_work code paths so both agree with validate_total_shards. ``` </details> </blockquote></details> <details> <summary>crates/cardano/src/ashard/shard.rs (1)</summary><blockquote> `37-60`: **`debug_assert!` makes invalid `total_shards` a runtime div-by-zero in release.** The contract is documented and there's a `validate_total_shards` helper, but the only enforcement inside `shard_key_range` is `debug_assert!`. In release builds, `total_shards == 0` reaches `let step = PREFIX_SPACE / total_shards;` and panics with a less informative message; non-divisor values silently produce truncated `step` arithmetic. Since this is a public function and the cost of a real assertion (or returning `Result`) is negligible compared to the I/O the caller does, prefer `assert!` (or propagate the error from `validate_total_shards`). <details> <summary>🛡️ Proposed change</summary> ```diff pub fn shard_key_range(shard_index: u32, total_shards: u32) -> Range<EntityKey> { - debug_assert!(validate_total_shards(total_shards).is_ok()); - debug_assert!(shard_index < total_shards); + assert!( + validate_total_shards(total_shards).is_ok(), + "invalid total_shards: {total_shards}" + ); + assert!(shard_index < total_shards, "shard_index {shard_index} >= total_shards {total_shards}"); ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ashard/shard.rs` around lines 37 - 60, The function shard_key_range uses debug_assert! to check validate_total_shards(total_shards) and shard_index < total_shards, which leaves a division-by-zero and silent truncation in release; change the debug_assert! checks to real runtime checks (e.g., assert!(validate_total_shards(total_shards).is_ok()) and assert!(shard_index < total_shards)) or convert the function to return Result and propagate validate_total_shards() error from shard_key_range so PREFIX_SPACE / total_shards cannot divide by zero and invalid total_shards are handled deterministically; update callers if you choose the Result approach and keep references to shard_key_range, validate_total_shards, and PREFIX_SPACE when making the change. ``` </details> </blockquote></details> <details> <summary>crates/cardano/src/ewrap/loading.rs (1)</summary><blockquote> `280-285`: **`shard_applied_*` fields are dead state on Ewrap-constructed `BoundaryWork`.** `new_empty` zero-initializes `shard_applied_effective`, `shard_applied_unspendable_to_treasury`, and `shard_applied_unspendable_to_reserves`, but Ewrap never reads/writes them — the per-shard accumulators land in `EpochState.end` via `EpochEndAccumulate` and are read from `ending_state.end` by `wrapup.flush`. The fields exist only because `BoundaryWork` is shared with the AShard path. Not a bug, but the dual-use API obscures intent. If a future change adds a third caller, it's easy to misuse these fields. Consider either splitting `BoundaryWork` into `AShardWork`/`EwrapWork` or documenting on the struct which fields each phase owns. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ewrap/loading.rs` around lines 280 - 285, BoundaryWork's shard_applied_effective / shard_applied_unspendable_to_treasury / shard_applied_unspendable_to_reserves are dead for the Ewrap path (zeroed in new_empty and never read by Ewrap; real per-shard accumulators live in EpochState.end via EpochEndAccumulate and are consumed by wrapup.flush), so fix by either splitting the dual-purpose struct into two clear types (e.g., AShardWork and EwrapWork) and move the AShard-only shard_applied_* fields into AShardWork, or add explicit documentation/comments on BoundaryWork and those three fields noting they are owned/used only by the AShard path; update new_empty, any constructors, and usages of BoundaryWork (including references in EpochEndAccumulate and wrapup.flush) to use the new types or to reflect the documented ownership to prevent future misuse. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@crates/cardano/src/model/epochs.rs:
- Around line 737-740: The apply path currently does
entity.end.as_mut().expect("ESTART seeded EpochState.end before shards run")
which panics if EpochState.end is None; instead ensure end is initialized to
EndStats::default() when missing (so subsequent EpochWrapUp can overwrite it).
Update the apply code in crates/cardano/src/model/epochs.rs to replace the
expect with a safe initialization (e.g., use get_or_insert_with / get_or_insert
or similar) so entity.end becomes Some(EndStats::default()) if it was None
before mutating it.- Around line 684-771: EpochEndAccumulate::undo currently assumes apply always
mutated state, causing underflow/corruption when apply returned early; fix by
recording in apply whether it performed mutations and the prior values needed to
revert (capture previous end.effective_rewards, end.unspendable_to_treasury,
end.unspendable_to_reserves and previous entity.ashard_progress/AShardProgress)
into fields on EpochEndAccumulate (or a was_applied flag plus prev_* fields),
then make undo a no-op if apply was skipped or restore those captured prev_*
values and previous ashard_progress instead of unconditionally subtracting and
setting committed/total; update apply (function EpochEndAccumulate::apply) to
populate these fields only when it actually mutates state and add a proptest
exercising apply→undo when ashard_progress is already advanced past
completed_shard_index to lock the behavior.In
@crates/core/src/config.rs:
- Around line 988-1002: Add runtime validation to reject invalid account_shards
values by calling the existing validate_total_shards() from
crates/cardano/src/ashard/shard.rs during config initialization: when
constructing/using CardanoConfig (e.g., inside CardanoLogic::initialize() before
any use of CardanoConfig::account_shards() or DEFAULT_ACCOUNT_SHARDS), call
validate_total_shards(self.account_shards.unwrap_or(Self::DEFAULT_ACCOUNT_SHARDS))
and return/propagate an error if it fails; ensure the initialization path stops
the startup with a clear error rather than allowing invalid values (0 or
non-divisors of 256) to be used.In
@skills/debug-epoch-mismatch/SKILL.md:
- Line 138: The documentation still treats reward application as a single EWRAP
work unit but the code split moves per-account reward application into AShard
while Ewrap now handles MIR/global closure; update the tables that list or
reference EWRAP (notably the "Classify the Mismatch", "Dolos Work Units →
Haskell Concepts", and the instrumentation hypothesis table) to distinguish
AShard vs Ewrap: point applyRUpd + reward filtering, unspendable routing and
registration filtering to AShard and point MIR and boundary close to Ewrap, and
adjust any bisection/debugging guidance so it routes to AShard for per-account
reward issues and to Ewrap for MIR/global closure issues. Ensure references
include the file/class names AShard, Ewrap and the function/concept names
applyRUpd, MIR, reward filtering, boundary close to make locating code
straightforward.
Outside diff comments:
In@crates/cardano/src/ashard/rewards.rs:
- Around line 114-188: The per-shard logs in flush don't indicate which shard
they belong to; update the flush logging to include shard context (shard_index
and total_shards). Add shard_index: usize and total_shards: usize as fields on
the AShard struct (or otherwise make them available on self or ctx), ensure any
AShard construction/population passes these values, and then include
%shard_index and %total_shards in the tracing::debug! and tracing::error!
invocations (and the debug! at the end) around ctx.rewards/ drain_unspendable/
applied_reward_credentials so all per-shard telemetry shows shard scope.In
@crates/cardano/src/model/epochs.rs:
- Around line 980-981: The rustdoc warning comes from using a /// doc comment
immediately above the prop_compose! macro (related to EpochStatsUpdate::apply
and rolling.live_mut which asserts rolling.next is None), so replace that
leading /// comment with a non-doc comment (//) or move the explanation into the
generator body/closure so rustdoc won't try to document the macro invocation;
specifically update the comment above the prop_compose! invocation that
references rolling.live_mut and rolling.next to use // (or relocate it inside
the prop_compose! closure) to silence the unused_doc_comments lint.
Nitpick comments:
In@crates/cardano/src/ashard/loading.rs:
- Around line 19-57: The function load_pending_rewards_range currently accepts
Option<Range> but the only caller (load_ashard) always passes
Some(range), so tighten the signature to take Range directly: change
the parameter type in load_pending_rewards_range and update callers (e.g., in
load_ashard remove Some(range.clone()) and pass range.clone() directly); ensure
the body drops any pattern-matching for None (no-op) and uses the range value
when calling
state.iter_entities_typed::(PendingRewardState::NS, range).In
@crates/cardano/src/ashard/rewards.rs:
- Around line 56-77: Update the stale comment referencing "EWRAP" to reflect the
new pipeline stage: change the text around the account deregistration
explanation to say "deregistered before AShard" (or "before the boundary's
per-account leg") so it matches where the per-account registration check now
runs; the relevant area to edit is the comment above the account.is_registered()
branch in the function that calls reward.total_value() and
ctx.rewards.return_reward_to_treasury(total). Ensure the updated comment keeps
the existing clarification about pre-filtered accounts and returned_rewards
behavior.In
@crates/cardano/src/ashard/shard.rs:
- Around line 37-60: The function shard_key_range uses debug_assert! to check
validate_total_shards(total_shards) and shard_index < total_shards, which leaves
a division-by-zero and silent truncation in release; change the debug_assert!
checks to real runtime checks (e.g.,
assert!(validate_total_shards(total_shards).is_ok()) and assert!(shard_index <
total_shards)) or convert the function to return Result and propagate
validate_total_shards() error from shard_key_range so PREFIX_SPACE /
total_shards cannot divide by zero and invalid total_shards are handled
deterministically; update callers if you choose the Result approach and keep
references to shard_key_range, validate_total_shards, and PREFIX_SPACE when
making the change.In
@crates/cardano/src/ashard/work_unit.rs:
- Around line 62-129: The load/commit_state duplication and drift risk can be
fixed by capturing the computed total_shards and range during load and reusing
them in commit_state instead of re-reading state: after computing total_shards
and range in load (before calling BoundaryWork::load_ashard), store the values
on the work unit (e.g., add fields self.total_shards and/or self.range and
assign them there), and then change commit_state to take the boundary via
self.boundary.as_mut() and use the stored range/total_shards when calling
boundary.commit_ashard::(...) rather than recomputing via load_epoch; update
BoundaryWork usage only to consume the already-stored range if needed.In
@crates/cardano/src/ewrap/commit.rs:
- Around line 91-100: The current call to stream_and_apply_namespace::<D,
AccountState>(state, &writer, None) performs a full-table scan of AccountState
each boundary; instead implement a targeted path that iterates over
self.applied_mir_credentials (or the credentials from queued AssignRewards
deltas), point-reads each AccountState, applies the MIR delta, and writes the
result via the same writer so only affected accounts are touched; keep
stream_and_apply_namespace for full-set namespaces like
Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing
apply/serialize logic used by stream_and_apply_namespace to maintain invariants
and error handling.In
@crates/cardano/src/ewrap/loading.rs:
- Around line 280-285: BoundaryWork's shard_applied_effective /
shard_applied_unspendable_to_treasury / shard_applied_unspendable_to_reserves
are dead for the Ewrap path (zeroed in new_empty and never read by Ewrap; real
per-shard accumulators live in EpochState.end via EpochEndAccumulate and are
consumed by wrapup.flush), so fix by either splitting the dual-purpose struct
into two clear types (e.g., AShardWork and EwrapWork) and move the AShard-only
shard_applied_* fields into AShardWork, or add explicit documentation/comments
on BoundaryWork and those three fields noting they are owned/used only by the
AShard path; update new_empty, any constructors, and usages of BoundaryWork
(including references in EpochEndAccumulate and wrapup.flush) to use the new
types or to reflect the documented ownership to prevent future misuse.In
@crates/cardano/src/lib.rs:
- Around line 284-341: Extract and reuse the epoch load and the
effective-account-shards logic: in initialize(), call load_epoch::(state)
once and store the resulting EpochState (if Ok) for the warnings; replace the
inline computation of effective_account_shards with a call to the existing
helper read_effective_account_shards (or lift that helper to a free fn if
necessary) so the "persisted total else config" rule is implemented in one
place; update references to use the single loaded epoch value for the
tracing::warn! blocks and derive effective_account_shards from
read_effective_account_shards(state, &config) (or equivalent) to remove the
duplicate load_epoch::(state) call.- Around line 404-408: The code refreshes self.effective_account_shards on every
pop_work by calling read_effective_account_shards::(domain.state()), but this
value only matters when the epoch boundary handoff
(EpochState.ashard_progress.total) can change; avoid the repeated read by first
checking whether ashard_progress has changed and only calling
read_effective_account_shards when it has. Add a small cached marker on the
struct (e.g., last_ashard_total or last_ashard_progress) and in pop_work compare
domain.state().ashard_progress.total (or the appropriate accessor) to that
cached value; if different, call read_effective_account_shards and update both
effective_account_shards and the cached marker, otherwise skip the read and keep
the existing effective_account_shards.In
@crates/cardano/src/work.rs:
- Around line 276-279: Add a new unit test that sets TEST_TOTAL_SHARDS > 1
(instead of the current const TEST_TOTAL_SHARDS = 1) to exercise the multi-shard
sequencing in the pop_work logic: drive the WorkBuffer/state machine (invoking
pop_work) and assert that it emits exactly N consecutive AShard units (N ==
TEST_TOTAL_SHARDS) before producing an EwrapBoundary (or EWrap/Ewrap as used in
this code), and verify that last_point_seen and any cursor invariants are
preserved across the transition; target the test to exercise the pop_work path
that contains the next_index >= total_shards branch and use the existing
helpers/setup used by the other tests so it integrates with the same state
machine and coverage.- Around line 120-143: The code currently treats account_shards == 0 as a valid
"no shards" path in on_ewrap_boundary (and similarly in the PreEwrapBoundary arm
of pop_work), which contradicts validate_total_shards that requires
account_shards >= 1; change the behavior to fail-fast: remove the special-case
branch that returns EwrapBoundary when account_shards == 0 and instead assert or
panic if account_shards < 1 (e.g., add assert!(account_shards >= 1) at the start
of on_ewrap_boundary), keeping the normal Restart -> AShardingBoundary path for
valid shard counts; alternatively, if the intended semantics are that 0 means
"no shards", update validate_total_shards to accept 0 and document that
invariant—pick one consistent approach and apply the same fix in the
PreEwrapBoundary/pop_work code paths so both agree with validate_total_shards.In
@crates/cardano/work_units.md:
- Around line 5-11: The fenced code block containing the epoch diagram starting
with "Estart → Roll … → Rupd → Roll … → AShard ×N → Ewrap" is missing
a language tag (MD040); fix it by adding a language identifier (use "text")
immediately after the opening triple backticks so the block becomes ```text and
the diagram passes markdownlint-cli2 validation.In
@tests/epoch_pots/main.rs:
- Around line 500-513: Collapse the two separate
if let Some(boundary)checks
into a single binding forboundarywhen matching
CardanoWorkUnit::AShard(shard): inside oneif let Some(boundary) = shard.boundary()block first checkif shard.shard_index() == 0to clear
accumulated_appliedand setaccumulated_ending_epoch = Some(boundary.ending_state().number), then unconditionally call
accumulated_applied.extend(boundary.applied_rewards.iter().cloned()); this
removes the duplicated boundary lookup and makes the reset-and-extend logic
contiguous.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `13b48d06-fb3d-459b-b0a0-59ac2b808397` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 885094deed721abaa8bca14087703bd7ea409c28 and 8174d2a832bd81baa5228d8b3059b0b2e0bfd7ba. </details> <details> <summary>📒 Files selected for processing (23)</summary> * `crates/cardano/src/ashard/commit.rs` * `crates/cardano/src/ashard/loading.rs` * `crates/cardano/src/ashard/mod.rs` * `crates/cardano/src/ashard/rewards.rs` * `crates/cardano/src/ashard/shard.rs` * `crates/cardano/src/ashard/work_unit.rs` * `crates/cardano/src/ewrap/commit.rs` * `crates/cardano/src/ewrap/loading.rs` * `crates/cardano/src/ewrap/mod.rs` * `crates/cardano/src/ewrap/work_unit.rs` * `crates/cardano/src/ewrap/wrapup.rs` * `crates/cardano/src/genesis/mod.rs` * `crates/cardano/src/lib.rs` * `crates/cardano/src/model/epochs.rs` * `crates/cardano/src/model/mod.rs` * `crates/cardano/src/model/pending.rs` * `crates/cardano/src/rupd/work_unit.rs` * `crates/cardano/src/work.rs` * `crates/cardano/work_units.md` * `crates/core/src/config.rs` * `skills/debug-epoch-mismatch/SKILL.md` * `tests/epoch_pots/main.rs` * `tests/memory.rs` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
apply has three early-return guards (idempotent repeat, out-of-order, total_shards mismatch) that leave state untouched. undo unconditionally subtracted the deltas and overwrote ashard_progress, so a rollback following a skipped apply would underflow the u64 end.* fields and clobber the cursor. Capture prev_ashard_progress and set an applied flag during apply only when state is actually mutated; undo early-returns when !applied and restores from the snapshot. Same pattern as EpochWrapUp/EpochStatsUpdate. Also broaden any_epoch_state to vary ashard_progress so the existing roundtrip proptests for EpochWrapUp and EpochTransition exercise the Some(_) → None → Some(_) path their apply/undo introduced earlier in this branch (previously only None → None was covered). Add a dedicated epoch_end_accumulate_roundtrip proptest covering all four progress shapes and all three skip branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The divides-256 invariant on account_shards is enforced only via debug_assert! in shard_key_range(), which is stripped in release builds. An invalid TOML value (0, 3, 7, 100, ...) would deserialize cleanly and silently corrupt key-range coverage. Call validate_total_shards() at the top of CardanoLogic::initialize and surface failures as ChainError::InvalidConfig so misconfiguration fails the startup with a clear message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After this branch's restructure, per-account reward application, unspendable routing, and EWRAP-time registration filtering live in AShard; only MIRs, refunds, and boundary close remain in Ewrap. Update the Classify, Work Units, Source Files, and Instrumentation tables so the bisection workflow points at the right work unit (and module path) for each failure shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/cardano/src/lib.rs (1)
250-258: Minor: deduplicate effective-shard-count computation.The body of
read_effective_account_shards(lines 253-258) is identical to the inline block at lines 345-348 ininitialize. SinceSelfisn't yet constructed there, the method form can't be reused, but extracting a free helper (orSelf::compute_effective_account_shards<D>(state, &config)) would keep both call sites in sync if the fallback rule ever changes.♻️ Suggested refactor
- /// Compute the effective `account_shards` value: stored - /// `ashard_progress.total` if a boundary is in flight, otherwise the - /// configured value. - fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 { - load_epoch::<D>(state) - .ok() - .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total)) - .unwrap_or_else(|| self.config.account_shards()) - } + /// Compute the effective `account_shards` value: stored + /// `ashard_progress.total` if a boundary is in flight, otherwise the + /// configured value. + fn compute_effective_account_shards<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 { + load_epoch::<D>(state) + .ok() + .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total)) + .unwrap_or_else(|| config.account_shards()) + } + + fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 { + Self::compute_effective_account_shards::<D>(state, &self.config) + }Then at line 345-348:
- let effective_account_shards = load_epoch::<D>(state) - .ok() - .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total)) - .unwrap_or_else(|| config.account_shards()); + let effective_account_shards = + Self::compute_effective_account_shards::<D>(state, &config);Also applies to: 343-348
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/lib.rs` around lines 250 - 258, The effective-shard-count computation is duplicated between the method read_effective_account_shards and the inline block in initialize; extract the logic into a single helper (either a free function like compute_effective_account_shards::<D>(state, &config) or an associated fn Self::compute_effective_account_shards::<D>(state, &config)) that returns the u32 by loading the epoch and falling back to config.account_shards(), then replace both the body of read_effective_account_shards and the inline compute in initialize to call that helper so both call sites stay in sync if the fallback changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 250-258: The effective-shard-count computation is duplicated
between the method read_effective_account_shards and the inline block in
initialize; extract the logic into a single helper (either a free function like
compute_effective_account_shards::<D>(state, &config) or an associated fn
Self::compute_effective_account_shards::<D>(state, &config)) that returns the
u32 by loading the epoch and falling back to config.account_shards(), then
replace both the body of read_effective_account_shards and the inline compute in
initialize to call that helper so both call sites stay in sync if the fallback
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3543b5af-c9cb-43d0-942e-a2c733a526c3
📒 Files selected for processing (4)
crates/cardano/src/lib.rscrates/cardano/src/model/epochs.rscrates/core/src/lib.rsskills/debug-epoch-mismatch/SKILL.md
✅ Files skipped from review due to trivial changes (1)
- crates/core/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- skills/debug-epoch-mismatch/SKILL.md
Wraps every phase (load/compute/commit_*) of `execute_work_unit` in sync and import with an `RssProbe` that emits an `info!` event with `phase`, `rss_before_mb`, `rss_after_mb`, `rss_delta_mb`. Attaches via the surrounding `#[instrument(name = "work_unit")]` span so the work unit's name is included automatically. Helps localize boundary memory spikes (RUPD/AShard/Ewrap/ESTART) without per-crate boilerplate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Account/PendingReward keys are 32-byte CBOR encodings of `StakeCredential` whose first 4 bytes (`0x82 <variant> 0x58 0x1c`) carry no entropy across credentials. Sharding by `key[0]` therefore funnelled every credential into a single bucket regardless of `account_shards`, so only one shard ever did real work. `shard_key_range` becomes `shard_key_ranges` and returns one range per `StakeCredential` variant, sliced on `key[4]` (first byte of the actual hash). Each AShard now scans two contiguous ranges that together cover ~1/N of the credential keyspace, giving even per-shard work without any data migration. The CBOR layout invariant is asserted in a unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoist the credential-keyed shard helpers from `ashard/shard.rs` to a top-level `crate::shard` module and apply the same per-shard pattern to ESTART so per-account snapshot rotations stream through bounded- memory shards instead of accumulating millions of `AccountTransition` deltas in one Vec. New work unit `EStartShardWorkUnit` lives in a sibling `estart_shard/` module that adds shard-aware load/commit methods to `WorkContext`, mirroring the `ashard/` ↔ `ewrap/` relationship. The existing `EstartWorkUnit` is repurposed as the finalize half: pool / drep / proposal transitions, the closing `EpochTransition` (epoch advance + new pots + era migration), archive logs, and the cursor advance (which only ever moves here — never per shard). Boundary pipeline is now: Blocks → AShard×N → Ewrap → EStartShard×N → Estart(finalize) → Blocks Same `CardanoConfig::account_shards` drives both halves. Progress is tracked separately on `EpochState.estart_shard_progress` (parallel to `ashard_progress`); only one of the two is ever populated at once. `EStartShardAccumulate` mirrors `EpochEndAccumulate`'s idempotency, ordering, and total-mismatch guards. `EpochTransition` snapshots and clears both progress fields. The crash-recovery warning at startup covers both halves; full mid-EStart-shard resume remains the same TODO posture as the existing AShard pipeline (`AccountTransition` is not natively idempotent on re-apply). `AShardProgress` is renamed to `ShardProgress` since it now serves two phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update sequence diagram to show the open-half pipeline (EStartShard ×N → Estart) alongside the close-half (AShard ×N → Ewrap), add a new section for `EStartShardWorkUnit`, and trim Estart's section to its finalize-only delta set (drops `AccountTransition`, calls out that `EpochTransition` now clears both `ashard_progress` and `estart_shard_progress`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshape the epoch-boundary pipeline so each half is one self-contained work unit instead of four: - `WorkUnit` trait gains `total_shards()`, `initialize()`, `finalize()`, and a `shard_index: u32` parameter on the per-phase methods. The core executor (`crates/core/src/sync.rs::run_lifecycle`) loops the load/compute/commit cycle once per shard between `initialize` and `finalize`. Default `total_shards = 1` keeps non-sharded work units untouched. - `EwrapWorkUnit` (close half, formerly `AShardWorkUnit`) and `EstartWorkUnit` (open half, formerly `EStartShardWorkUnit`) absorb their respective global passes via `finalize()`. The pre-existing `Ewrap`/`Estart` work units are removed; `CardanoWorkUnit` shrinks from 7 variants to 5 and `WorkBuffer` from 12 states to 10. Stop-epoch logic moves into `EstartBoundary`'s transition (cursor still advances only there). - All code that runs as part of a single work unit lives under one module: `ashard/` + `ewrap/` collapsed into `crates/cardano/src/ewrap/`, `estart_shard/` + `estart/` collapsed into `crates/cardano/src/estart/`. AVVM reclamation hoisted out of the per-shard `load` into `initialize`. - Persisted state field names (`ashard_progress`, `estart_shard_progress`) and the Serde-tagged `EStartShardAccumulate` delta are preserved to avoid an on-disk migration. The `WorkBuffer` no longer enumerates shards; `CardanoLogic` drops the `effective_account_shards` cache. Crash recovery still relies on the existing `committed` guards in `EpochEndAccumulate` / `EStartShardAccumulate` — same correctness posture as before. Test harness gains a post-finalize callback (fires once with `shard_index == total_shards`) so per-boundary introspection (e.g. `epoch_pots`) sees the global teardown state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dary nomenclature Rename the per-shard progress deltas and the persisted EpochState progress fields to match the post-merge work-unit identities (Ewrap / Estart): - `EpochEndAccumulate` → `EWrapProgress` (carries shard reward accumulators + cursor) - `EStartShardAccumulate` → `EStartProgress` (cursor only — per-account Estart work lands directly on AccountState) - `EpochState.ashard_progress` → `EpochState.ewrap_progress` - `EpochState.estart_shard_progress` → `EpochState.estart_progress` - `EWrapProgress.prev_ashard_progress` → `prev_ewrap_progress` - `EStartProgress.prev_estart_shard_progress` → `prev_estart_progress` CBOR compatibility preserved: the minicbor `#[n(15)]` / `#[n(16)]` positional indices on the EpochState fields are unchanged, so existing on-disk state deserializes unmodified. Field-name changes only affect serde-tagged paths (ad-hoc JSON dumps, debug prints), not the durable state. The delta type rename does change the `CardanoDelta` enum's serde encoding (variant tag is the type name), so any node mid-sync with a populated WAL will need to wipe / re-bootstrap. Fresh nodes are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
crates/cardano/src/ewrap/loading.rs (1)
159-165: Acknowledge the ordering constraint; consider tracking the TODO.The comment clearly documents the rewards-before-drops ordering requirement and its rationale. The TODO to move retires to ESTART would eliminate this ordering hack.
Would you like me to open an issue to track this refactoring task (moving retires to ESTART after snapshot)?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/ewrap/loading.rs` around lines 159 - 165, Acknowledge the rewards-before-drops ordering constraint and record the TODO by opening a tracking issue: create an issue titled something like "Refactor: move retires to ESTART to remove rewards-before-drops ordering hack" referencing the comment in crates/cardano/src/ewrap/loading.rs, summarizing the current hack (rewards must apply before drops because refunds clone live values pre-snapshot), and add acceptance criteria that retires are moved to ESTART and the ordering hack/comment removed; include links to the affected code and label it as a refactor/tech-debt item for prioritization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/cardano/src/estart/work_unit.rs`:
- Around line 91-120: The initialize method only restores estart_progress.total
but not the committed shard index, so on restart shards will be replayed from 0
and non-idempotent AccountTransition will double-rotate accounts; update
initialize (in work_unit.rs) to read estart_progress.committed (if present) and
set the work-unit's starting shard/next-shard field accordingly (fall back to 0
when absent), and also update the shard iteration in the core runner (the loop
in crates/core/src/sync.rs that currently starts at 0) to begin at that restored
committed index so earlier shards are skipped or treated as already applied
before building/applying account deltas; reference symbols: initialize,
estart_progress.committed, EStartProgress, and the shard iteration in
crates/core/src/sync.rs.
In `@crates/cardano/src/ewrap/commit.rs`:
- Around line 169-198: stream_and_apply_namespace::<D, EpochState> applies the
final EpochState into the writer but does not update
BoundaryWork.ending_state(), so archive_writer.write_log_typed is archiving a
stale self.ending_state(); capture the applied EpochState from the stream path
(e.g. return or output value of stream_and_apply_namespace for EpochState) or
reload the EpochState from the writer/namespace after streaming, then either
assign that value into self.ending_state() or pass that captured EpochState
directly to archive_writer.write_log_typed instead of using self.ending_state()
so the archived record matches the live, post-EpochWrapUp state.
In `@crates/cardano/src/shard.rs`:
- Around line 66-74: The function shard_key_ranges currently uses debug_assert!
for validate_total_shards and shard_index checks which are removed in release
builds, causing division by zero or incorrect partitioning (PREFIX_SPACE /
total_shards) at runtime; change the function to return a Result or use
assert!/explicit validation so invalid total_shards or shard_index produce a
clear error. Specifically, replace the debug_assert! calls in shard_key_ranges
with a runtime check that calls validate_total_shards(total_shards) and returns
Err(...) on failure (or panics with assert! including the failing value), and
also validate shard_index < total_shards before calling variant_range; ensure
callers (e.g., code that reads ShardProgress.total) handle the Result if you
choose the Result approach.
In `@crates/cardano/work_units.md`:
- Around line 5-16: The fenced block containing the sequence diagram starting
with "Estart → Roll … → Rupd → Roll … → Ewrap" should include a language
tag to satisfy markdownlint MD040; update the opening fence from ``` to ```text
so the block is treated as plain text (leave the diagram content unchanged).
- Around line 141-143: Update the documentation to reference the new loader
name: replace the obsolete mention of BoundaryWork::load_ewrap with
BoundaryWork::load_finalize; locate the ewrap finalize docs that currently
describe "Builds a fresh `BoundaryWork` via `BoundaryWork::load_ewrap`" and
change that text to point to `BoundaryWork::load_finalize`, ensuring any
surrounding description or links reflect the new API name used in the
implementation (e.g., the loader invoked in the ewrap work unit).
In `@crates/core/src/work_unit.rs`:
- Around line 109-118: total_shards() must never return 0: change its return
type from u32 to core::num::NonZeroU32 (or alternatively enforce a runtime check
in the executor before running phases) so that a work unit cannot skip all
per-shard phases; update implementations that derive shard count in initialize()
to store and return a NonZeroU32 cached value, and update any call sites
(executor logic that iterates
load/compute/commit_wal/commit_state/commit_archive/commit_indexes) to accept
and unwrap the NonZeroU32 safely; if you prefer runtime validation instead, add
an explicit check in the executor (before running load/compute/commit_* but
after initialize()) that rejects total_shards() == 0 with a clear error.
In `@tests/bootstrap.rs`:
- Around line 40-49: The test helper currently calls
WorkUnit::<ToyDomain>::finalize(...) even though commit_archive() and
commit_indexes() are intentionally skipped, so it no longer models the "crash
after state commit" boundary; remove the finalize() call (or guard it behind a
flag) so the helper stops after WorkUnit::<ToyDomain>::commit_state(...),
leaving commit_archive() and commit_indexes() unexecuted to accurately simulate
the crash-after-state scenario and allow recovery paths to be tested.
In `@tests/memory.rs`:
- Around line 141-163: The test currently only samples Region::change() before
iterator use, so update it to also sample after consuming the iterator: call let
stats_before = reg.change() (or reuse heap_delta), consume the iterator with let
count = iter.count(), then call let stats_after = reg.change() and compute let
heap_delta_consumption = stats_after.bytes_allocated -
stats_before.bytes_allocated; finally assert that heap_delta_consumption is
below the same threshold (or a separate reasonable threshold) to ensure
iter_entities + iteration is heap-bounded. Reference the existing iter_entities,
iter.count(), Region::change(), and heap_delta/threshold variables when adding
the second assertion.
---
Nitpick comments:
In `@crates/cardano/src/ewrap/loading.rs`:
- Around line 159-165: Acknowledge the rewards-before-drops ordering constraint
and record the TODO by opening a tracking issue: create an issue titled
something like "Refactor: move retires to ESTART to remove rewards-before-drops
ordering hack" referencing the comment in crates/cardano/src/ewrap/loading.rs,
summarizing the current hack (rewards must apply before drops because refunds
clone live values pre-snapshot), and add acceptance criteria that retires are
moved to ESTART and the ordering hack/comment removed; include links to the
affected code and label it as a refactor/tech-debt item for prioritization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef9f2a57-98e9-488b-a6d0-c5240e79109d
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (32)
crates/cardano/src/estart/commit.rscrates/cardano/src/estart/loading.rscrates/cardano/src/estart/mod.rscrates/cardano/src/estart/reset.rscrates/cardano/src/estart/work_unit.rscrates/cardano/src/ewrap/commit.rscrates/cardano/src/ewrap/loading.rscrates/cardano/src/ewrap/mod.rscrates/cardano/src/ewrap/refunds.rscrates/cardano/src/ewrap/rewards.rscrates/cardano/src/ewrap/work_unit.rscrates/cardano/src/ewrap/wrapup.rscrates/cardano/src/genesis/mod.rscrates/cardano/src/genesis/work_unit.rscrates/cardano/src/lib.rscrates/cardano/src/model/epochs.rscrates/cardano/src/model/mod.rscrates/cardano/src/model/pending.rscrates/cardano/src/roll/work_unit.rscrates/cardano/src/rupd/work_unit.rscrates/cardano/src/shard.rscrates/cardano/src/work.rscrates/cardano/work_units.mdcrates/core/Cargo.tomlcrates/core/src/config.rscrates/core/src/import.rscrates/core/src/sync.rscrates/core/src/work_unit.rscrates/testing/src/harness/cardano.rstests/bootstrap.rstests/epoch_pots/main.rstests/memory.rs
✅ Files skipped from review due to trivial changes (5)
- crates/cardano/src/estart/mod.rs
- crates/cardano/src/ewrap/refunds.rs
- crates/core/Cargo.toml
- crates/cardano/src/ewrap/rewards.rs
- crates/cardano/src/model/pending.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/core/src/config.rs
| /// Number of shards this work unit splits into. | ||
| /// | ||
| /// The executor calls each per-shard phase `total_shards()` times. | ||
| /// Defaults to `1` for non-sharded work units. The returned value | ||
| /// must be valid after `initialize()` has run — implementations that | ||
| /// derive the count from persisted state should compute it inside | ||
| /// `initialize()` and cache it on `self`. | ||
| fn total_shards(&self) -> u32 { | ||
| 1 | ||
| } |
There was a problem hiding this comment.
Make zero shards impossible.
total_shards() == 0 now means no load/compute/commit_* phase runs, yet the work unit can still proceed to finalize() and post-work hooks. One bad implementation can therefore "complete" without processing any shard. Please encode this as NonZeroU32 or reject zero before execution.
Based on learnings: Work units function as mini-ETL jobs that extract data from storage, transform it using chain-specific logic, and load results into appropriate stores following the sequence: load() → compute() → commit_wal() → commit_state() → commit_archive() → commit_indexes().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/core/src/work_unit.rs` around lines 109 - 118, total_shards() must
never return 0: change its return type from u32 to core::num::NonZeroU32 (or
alternatively enforce a runtime check in the executor before running phases) so
that a work unit cannot skip all per-shard phases; update implementations that
derive shard count in initialize() to store and return a NonZeroU32 cached
value, and update any call sites (executor logic that iterates
load/compute/commit_wal/commit_state/commit_archive/commit_indexes) to accept
and unwrap the NonZeroU32 safely; if you prefer runtime validation instead, add
an explicit check in the executor (before running load/compute/commit_* but
after initialize()) that rejects total_shards() == 0 with a clear error.
RUPD was the boundary path's worst memory spike: a single load built the full per-account `accounts_by_pool` + `registered_accounts` plus an O(N) `RewardMap`. Mirror the Estart/Ewrap pattern — hoist pool-bounded globals into `initialize()` (pots, incentives, pparams, pool snapshots, pool_stake totals) and shard the per-credential leg across the same `account_shards` partitions: each shard streams `AccountState` over its two key ranges only, builds a shard-scoped delegator + registered set, runs `define_rewards` over every pool but emits only in-range credentials, and writes the in-range `PendingRewardState` entities. Leader-reward emission gates on a new default-true `RewardsContext::should_include`, so the shard whose range contains the operator credential is the sole emitter for that pool's leader reward. Delegator emissions are filtered naturally via `pool_delegators` returning only in-range creds, with a defensive `should_include` check at the merge site. Per-shard progress is tracked by a new `RupdProgress` delta on `EpochState.rupd_progress`, with the same idempotency + ordering + total-mismatch guards as `EWrapProgress` / `EStartProgress`. `EpochTransition` rollback now also captures and clears `rupd_progress`. `EpochState.incentives` and per-pool `StakeLog` archive entries move to `finalize()` (one-shot, after every shard has committed) so concurrent shard commits can't race; the per-pool reward + delegator-count contributions accumulate on the work unit as O(pools) state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…constant Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
crates/core/src/work_unit.rs (1)
60-69:⚠️ Potential issue | 🟠 MajorReject
0shards in the trait contract.A
u32return still lets an implementation report0afterinitialize(), which means the executor can skip everyload/compute/commit_*phase and jump straight tofinalize(). Make thisNonZeroU32or add an explicit runtime check in the runner.Based on learnings: Work units function as mini-ETL jobs that extract data from storage, transform it using chain-specific logic, and load results into appropriate stores following the sequence: load() → compute() → commit_wal() → commit_state() → commit_archive() → commit_indexes().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/core/src/work_unit.rs` around lines 60 - 69, The trait currently allows total_shards() to return 0 which skips all shard phases; fix by making the contract non-zero: change the return type of total_shards() to std::num::NonZeroU32 (update implementations to cache a NonZeroU32 in initialize()), or if you prefer not to change the trait signature, add an explicit runtime validation in the runner/executor (after calling initialize() and reading total_shards()) that checks shards != 0 and returns an Err or panics with a clear message; refer to the total_shards() method and initialize() call sites in the executor/runner that loop over shards to add the check.crates/cardano/src/estart/work_unit.rs (1)
93-106:⚠️ Potential issue | 🔴 CriticalResume from
estart_progress.committed, and don't mask epoch read failures.
initialize()still restores onlyestart_progress.total. After a crash, the executor will rerun shards0..k-1unless it also resumes fromcommitted, and this file already documentsAccountTransitionas non-idempotent. Falling back toACCOUNT_SHARDSonload_epocherrors makes that restart path even less safe by silently repartitioning the boundary.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/estart/work_unit.rs` around lines 93 - 106, The initialize() logic must resume from estart_progress.committed and must not silently mask load_epoch failures; instead of unconditionally falling back to ACCOUNT_SHARDS on Err(_), propagate the error and when load_epoch succeeds, set self.total_shards from epoch.estart_progress.as_ref().map(|p| p.total).unwrap_or(ACCOUNT_SHARDS) and also set the worker's resume point from epoch.estart_progress.as_ref().and_then(|p| p.committed) (e.g., assign to whatever field tracks the next shard to run), so shards restart from the committed position; remove the Err(_) => ACCOUNT_SHARDS branch and return the DomainError on read failure.crates/cardano/src/shard.rs (1)
76-79:⚠️ Potential issue | 🟠 MajorValidate shard counts in release builds too.
These guards disappear in release, but
total_shardsnow comes from persistedShardProgress.totalas well as the constant. A bad value can still panic onPREFIX_SPACE / total_shardsor produce broken partitions at runtime, so this needs a realassert!orResultpath instead ofdebug_assert!.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/shard.rs` around lines 76 - 79, The debug-only checks in shard_key_ranges are unsafe; replace the debug_assert! calls with real runtime validation: call validate_total_shards(total_shards) and return/propagate an error on failure (or use assert! if you prefer panicking) and explicitly check shard_index < total_shards, returning Err for invalid inputs; ensure callers that read persisted ShardProgress.total handle the Result. Update shard_key_ranges (and its callers) to use the new Result-returning signature or use assert! so that division by PREFIX_SPACE / total_shards and partitioning logic cannot run with invalid total_shards.
🧹 Nitpick comments (2)
crates/cardano/src/model/epochs.rs (1)
780-783: Theexpect()onentity.endremains fragile.Per the past review comment, this
expect()relies on Genesis seedingend = Some(EndStats::default())andEpochTransitionre-seeding it on every subsequent transition. While the comment was marked as addressed, the code still usesexpect()rather than the suggestedget_or_insert_with(EndStats::default)approach, which would be more defensive.This is low-risk since the invariant is maintained by the current code paths, but consider the defensive approach for robustness against future regressions.
🛡️ Optional defensive change
- let end = entity - .end - .as_mut() - .expect("ESTART seeded EpochState.end before shards run"); + let end = entity.end.get_or_insert_with(EndStats::default);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/model/epochs.rs` around lines 780 - 783, Replace the fragile expect() on entity.end with a defensive get_or_insert_with to ensure EndStats is initialized if missing; specifically, change the code path in the epochs handling where entity.end is accessed (currently using .end.as_mut().expect("ESTART seeded EpochState.end before shards run")) to call .end.get_or_insert_with(EndStats::default) or equivalent so the EndStats default is guaranteed without panicking, leaving the rest of the logic in the EpochTransition/epochs handling unchanged.crates/cardano/src/lib.rs (1)
349-448: Consider adding crash-recovery check forrupd_progress.The initialization checks
ewrap_progressandestart_progressfor crash-recovery scenarios, butrupd_progressis not checked. Since RUPD also uses the same sharded pattern withRupdProgressdeltas, operators would benefit from similar logging if a crash occurred mid-RUPD.📝 Suggested addition after line 447
// Same crash-recovery check for the RUPD phase. if let Some(progress) = epoch.rupd_progress.as_ref() { let configured = crate::shard::ACCOUNT_SHARDS; if progress.total != configured { tracing::warn!( epoch = epoch.number, stored_total = progress.total, configured_total = configured, "in-flight RUPD uses {} shards but ACCOUNT_SHARDS = {}; \ the in-flight RUPD will continue with {} (the persisted total)", progress.total, configured, progress.total, ); } if progress.committed < progress.total { tracing::warn!( epoch = epoch.number, next_shard = progress.committed, total_shards = progress.total, "crash detected mid-RUPD: rupd_progress is set. \ On the next RUPD trigger, dolos will resume the pipeline." ); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/lib.rs` around lines 349 - 448, Add the same crash-recovery logging for rupd_progress that exists for ewrap_progress and estart_progress: after loading epoch (load_epoch::<D> and the epoch variable), check epoch.rupd_progress.as_ref(), compare progress.total against crate::shard::ACCOUNT_SHARDS and emit a tracing::warn with epoch, stored_total, configured_total when they differ, and if progress.committed < progress.total emit a tracing::warn including next_shard and total_shards warning that a crash was detected mid-RUPD; mirror the style and fields used in the existing ewrap/estart blocks (use progress, progress.total, progress.committed, and epoch.number).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/cardano/src/ewrap/work_unit.rs`:
- Around line 71-84: The code in initialize uses load_epoch::<D>(domain.state())
and silently treats any Err as ACCOUNT_SHARDS, breaking the “persisted shard
count is authoritative” guarantee; change the Err(_) arm to propagate the error
instead of falling back. Replace the match so load_epoch errors are returned
(e.g., use let epoch = load_epoch::<D>(domain.state())?; or map_err into
DomainError and return Err) and then set self.total_shards from epoch. Keep the
existing Ok(epoch) branch logic (use epoch.ewrap_progress.as_ref().map(|p|
p.total).unwrap_or(ACCOUNT_SHARDS)) but do not convert load_epoch failures into
ACCOUNT_SHARDS.
In `@crates/cardano/src/genesis/mod.rs`:
- Around line 80-81: Fix the typo in the comment that currently reads "Ewrap
(which now runs before Ewrap)" in the seed `end` comment block in
genesis/mod.rs; update the text to reference the correct visitor (e.g.,
"AccountShard (which now runs before Ewrap)") so the comment correctly explains
that per-account shards run before the global Ewrap visitor, leaving the rest of
the sentence unchanged.
In `@crates/cardano/src/rupd/work_unit.rs`:
- Around line 380-391: Add an inline comment next to the hardcoded live_pledge:
0 in the StakeLog construction explaining why it's set to zero in this RUPD path
(e.g., "live_pledge is not computed during RUPD updates; computed in rewards
module" or note it's a known limitation), so future readers understand this is
intentional; update the comment adjacent to the StakeLog instantiation (where
live_pledge is assigned) and reference the rewards computation location or TODO
if it should be implemented later.
In `@crates/cardano/src/shard.rs`:
- Around line 46-49: The const assert using the modulo check should use the
const-friendly method: replace the predicate `PREFIX_SPACE % ACCOUNT_SHARDS ==
0` with `PREFIX_SPACE.is_multiple_of(ACCOUNT_SHARDS)` in the const assert that
references ACCOUNT_SHARDS and PREFIX_SPACE (the const block with the assert at
the top of shard.rs); update the assert to read that ACCOUNT_SHARDS >= 1 &&
PREFIX_SPACE.is_multiple_of(ACCOUNT_SHARDS) to silence the clippy
manual_is_multiple_of warning and keep semantics identical.
---
Duplicate comments:
In `@crates/cardano/src/estart/work_unit.rs`:
- Around line 93-106: The initialize() logic must resume from
estart_progress.committed and must not silently mask load_epoch failures;
instead of unconditionally falling back to ACCOUNT_SHARDS on Err(_), propagate
the error and when load_epoch succeeds, set self.total_shards from
epoch.estart_progress.as_ref().map(|p| p.total).unwrap_or(ACCOUNT_SHARDS) and
also set the worker's resume point from
epoch.estart_progress.as_ref().and_then(|p| p.committed) (e.g., assign to
whatever field tracks the next shard to run), so shards restart from the
committed position; remove the Err(_) => ACCOUNT_SHARDS branch and return the
DomainError on read failure.
In `@crates/cardano/src/shard.rs`:
- Around line 76-79: The debug-only checks in shard_key_ranges are unsafe;
replace the debug_assert! calls with real runtime validation: call
validate_total_shards(total_shards) and return/propagate an error on failure (or
use assert! if you prefer panicking) and explicitly check shard_index <
total_shards, returning Err for invalid inputs; ensure callers that read
persisted ShardProgress.total handle the Result. Update shard_key_ranges (and
its callers) to use the new Result-returning signature or use assert! so that
division by PREFIX_SPACE / total_shards and partitioning logic cannot run with
invalid total_shards.
In `@crates/core/src/work_unit.rs`:
- Around line 60-69: The trait currently allows total_shards() to return 0 which
skips all shard phases; fix by making the contract non-zero: change the return
type of total_shards() to std::num::NonZeroU32 (update implementations to cache
a NonZeroU32 in initialize()), or if you prefer not to change the trait
signature, add an explicit runtime validation in the runner/executor (after
calling initialize() and reading total_shards()) that checks shards != 0 and
returns an Err or panics with a clear message; refer to the total_shards()
method and initialize() call sites in the executor/runner that loop over shards
to add the check.
---
Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 349-448: Add the same crash-recovery logging for rupd_progress
that exists for ewrap_progress and estart_progress: after loading epoch
(load_epoch::<D> and the epoch variable), check epoch.rupd_progress.as_ref(),
compare progress.total against crate::shard::ACCOUNT_SHARDS and emit a
tracing::warn with epoch, stored_total, configured_total when they differ, and
if progress.committed < progress.total emit a tracing::warn including next_shard
and total_shards warning that a crash was detected mid-RUPD; mirror the style
and fields used in the existing ewrap/estart blocks (use progress,
progress.total, progress.committed, and epoch.number).
In `@crates/cardano/src/model/epochs.rs`:
- Around line 780-783: Replace the fragile expect() on entity.end with a
defensive get_or_insert_with to ensure EndStats is initialized if missing;
specifically, change the code path in the epochs handling where entity.end is
accessed (currently using .end.as_mut().expect("ESTART seeded EpochState.end
before shards run")) to call .end.get_or_insert_with(EndStats::default) or
equivalent so the EndStats default is guaranteed without panicking, leaving the
rest of the logic in the EpochTransition/epochs handling unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c85cdd15-e7d9-42c1-8e53-7e98f8da91d2
📒 Files selected for processing (14)
crates/cardano/src/estart/work_unit.rscrates/cardano/src/ewrap/work_unit.rscrates/cardano/src/genesis/mod.rscrates/cardano/src/lib.rscrates/cardano/src/model/epochs.rscrates/cardano/src/model/mod.rscrates/cardano/src/rewards/mod.rscrates/cardano/src/rupd/loading.rscrates/cardano/src/rupd/mod.rscrates/cardano/src/rupd/work_unit.rscrates/cardano/src/shard.rscrates/cardano/work_units.mdcrates/core/src/sync.rscrates/core/src/work_unit.rs
| let log = StakeLog { | ||
| blocks_minted, | ||
| total_stake: pool_stake, | ||
| relative_size, | ||
| live_pledge: 0, | ||
| declared_pledge, | ||
| delegators_count: share.delegators_count, | ||
| total_rewards: share.total_rewards, | ||
| operator_share: share.operator_share, | ||
| fixed_cost, | ||
| margin_cost, | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if live_pledge is populated elsewhere or if this is expected
rg -n "live_pledge" crates/cardano/src/rupd/ -A 2 -B 2
rg -n "live_pledge" crates/cardano/src/model/logs.rs -A 2 -B 2Repository: txpipe/dolos
Length of output: 574
🏁 Script executed:
# Search for all StakeLog creations to understand if this is the only place
rg -n "StakeLog\s*\{" --type rust -A 15Repository: txpipe/dolos
Length of output: 3390
🏁 Script executed:
# Check for TODO/FIXME comments related to pledge or StakeLog
rg -n "(TODO|FIXME).*?(pledge|live|stake)" crates/cardano/src/ -iRepository: txpipe/dolos
Length of output: 38
🏁 Script executed:
# Check if StakeLog is used elsewhere and how live_pledge is typically populated
rg -n "live_pledge" crates/cardano/src/ -B 3 -A 1 --type rustRepository: txpipe/dolos
Length of output: 5777
Add a comment explaining why live_pledge is hardcoded to 0.
The StakeLog entry sets live_pledge: 0 in the RUPD context. While live_pledge is computed elsewhere in the codebase (in the rewards module), RUPD appears to intentionally omit this computation. Add an inline comment to clarify whether this is by design (e.g., "live pledge is not computed in RUPD updates") or if it's a known limitation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/cardano/src/rupd/work_unit.rs` around lines 380 - 391, Add an inline
comment next to the hardcoded live_pledge: 0 in the StakeLog construction
explaining why it's set to zero in this RUPD path (e.g., "live_pledge is not
computed during RUPD updates; computed in rewards module" or note it's a known
limitation), so future readers understand this is intentional; update the
comment adjacent to the StakeLog instantiation (where live_pledge is assigned)
and reference the rewards computation location or TODO if it should be
implemented later.
Bincode encodes CardanoDelta variants by positional index and structs by positional fields, so this branch's three new variants (EWrapProgress, EStartProgress, RupdProgress) inserted mid-enum and the appended `prev_*_progress` fields on EpochWrapUp / EpochTransition would have made pre-upgrade WAL rows undecodable. Freezes CardanoDelta indices 0..=38 to match pre-PR `main`, restores the legacy struct shapes verbatim under `#[deprecated]`, and introduces EpochWrapUpV2 / EpochTransitionV2 carrying the new undo state. The three sharded-progress variants plus the V2 variants are appended at the end of the enum. New commit paths in estart/reset.rs and ewrap/wrapup.rs emit V2; the legacy types remain solely for replay of older WAL rows and carry TODO(wal-compat) cleanup notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
crates/cardano/src/model/epochs.rs (2)
1490-1497: Remove unused#[allow(deprecated)]attribute.Static analysis correctly identifies this attribute as unused. Outer attributes on
prop_compose!macro invocations don't propagate into the macro expansion. The test function at line 1660 that callsany_epoch_wrap_up()already has its own#[allow(deprecated)], which is where the suppression actually takes effect.🧹 Suggested fix
- #[allow(deprecated)] prop_compose! { fn any_epoch_wrap_up()( stats in any_end_stats(), ) -> EpochWrapUp { EpochWrapUp::new(stats) } }As per coding guidelines: "Run
cargo clippy --workspace --all-targets --all-featuresand resolve all clippy warnings before committing changes".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/model/epochs.rs` around lines 1490 - 1497, Remove the unused outer attribute by deleting the #[allow(deprecated)] placed immediately above the prop_compose! invocation that defines any_epoch_wrap_up; the suppression does not propagate into the macro expansion and the test calling any_epoch_wrap_up already has its own #[allow(deprecated)], so keep that test attribute and remove the one before prop_compose! (symbols to locate: prop_compose!, any_epoch_wrap_up, any_end_stats, EpochWrapUp::new).
1603-1613: Remove unused#[allow(deprecated)]attribute.Same issue as line 1490 — this outer attribute on the macro invocation is ineffective. The test function at line 1685 already has the necessary
#[allow(deprecated)].🧹 Suggested fix
- #[allow(deprecated)] prop_compose! { fn any_epoch_transition()( new_epoch in root::any_epoch(), ) -> EpochTransition { // new_pots is filled in by the test harness from the entity's initial_pots // so that `new_pots.max_supply() == entity.initial_pots.max_supply()` holds // (which `apply`'s debug_assert requires). EpochTransition::new(new_epoch, crate::pots::Pots::default(), None, None) } }As per coding guidelines: "Run
cargo clippy --workspace --all-targets --all-featuresand resolve all clippy warnings before committing changes".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/cardano/src/model/epochs.rs` around lines 1603 - 1613, Remove the ineffective outer attribute #[allow(deprecated)] applied to the prop_compose! macro invocation that defines any_epoch_transition; simply delete that attribute so the macro block starts with prop_compose! { ... } and leave the rest (new_epoch in root::any_epoch(), EpochTransition::new(...)) unchanged — the test function already carries the needed #[allow(deprecated)] where necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/cardano/src/model/epochs.rs`:
- Around line 1490-1497: Remove the unused outer attribute by deleting the
#[allow(deprecated)] placed immediately above the prop_compose! invocation that
defines any_epoch_wrap_up; the suppression does not propagate into the macro
expansion and the test calling any_epoch_wrap_up already has its own
#[allow(deprecated)], so keep that test attribute and remove the one before
prop_compose! (symbols to locate: prop_compose!, any_epoch_wrap_up,
any_end_stats, EpochWrapUp::new).
- Around line 1603-1613: Remove the ineffective outer attribute
#[allow(deprecated)] applied to the prop_compose! macro invocation that defines
any_epoch_transition; simply delete that attribute so the macro block starts
with prop_compose! { ... } and leave the rest (new_epoch in root::any_epoch(),
EpochTransition::new(...)) unchanged — the test function already carries the
needed #[allow(deprecated)] where necessary.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 31123132-1f23-4705-8ad1-c58958f3b406
📒 Files selected for processing (4)
crates/cardano/src/estart/reset.rscrates/cardano/src/ewrap/wrapup.rscrates/cardano/src/model/epochs.rscrates/cardano/src/model/mod.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/cardano/src/estart/reset.rs
- crates/cardano/src/model/mod.rs
- crates/cardano/src/ewrap/wrapup.rs
Previously, `EwrapWorkUnit` / `EstartWorkUnit` / `RupdWorkUnit::initialize` read only `*_progress.total` from persisted state and the core lifecycle loop iterated `0..total_shards` unconditionally, so a crash mid-boundary replayed shards `0..k-1` on restart. That is unsafe: `AccountTransition` deltas (Estart) are non-idempotent — replaying a committed shard would double-rotate every account in it. Adds `WorkUnit::start_shard()` (defaulting to `0`); `run_lifecycle` now loops `start_shard..total_shards`. The three sharded work units cache the committed cursor in `initialize` from `*_progress.committed`, and their `start_shard()` returns it. `CardanoWorkUnit` delegates the new method to its variants. The bootstrap test harness mirrors the real runner. Also propagates `load_epoch` failures via `?` instead of the previous `Err(_) => ACCOUNT_SHARDS` swallow, so a state-read failure can no longer silently repartition an in-flight boundary. Addresses PR #978 review comments 3153861491 (restart cursor) and 3154574387 (propagate load_epoch failures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lize `commit_finalize` was archiving `self.ending_state()` at the epoch-start temporal key, but `stream_and_apply_namespace::<D, EpochState>` only applied the boundary-closing deltas (PParamsUpdate, TreasuryWithdrawal, EpochWrapUp) to the writer — `ending_state` itself was never refreshed. The archived row therefore carried the pre-commit snapshot (stale rolling/pparams, populated `ewrap_progress`). Adds an EpochState-specific variant of the streaming helper that returns the post-apply singleton; `commit_finalize` swaps the result into `self.ending_state` before the archive write, so the archived EpochState matches what's about to be committed to the live state store. Addresses PR #978 review comment 3153861497. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`shard_key_ranges` previously guarded `total_shards` and `shard_index` with `debug_assert!`, which compiles to nothing in release. A `0` would divide by zero in `variant_range`, and a non-divisor of 256 would silently produce broken partitions. Since `total_shards` can come from persisted `ShardProgress.total` (not just the compile-time `ACCOUNT_SHARDS` constant), the invariants must hold at runtime in all profiles. Promotes the validation to unconditional `assert!` / `panic!` with informative messages. Addresses PR #978 review comment 3153861503. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- genesis/mod.rs: fix copy/paste typo "Ewrap (which now runs before Ewrap)" → "per-shard Ewrap pass (which runs before the global Ewrap finalize)" (PR #978 comment 3154574432). - work_units.md: add `text` language to fenced sequence diagram block for markdownlint MD040 (3153861508); fix stale loader name `BoundaryWork::load_ewrap` → `load_finalize` (3153861515). - tests/memory.rs: sample heap allocation across full iteration as well as iterator construction, so a backend that buffers the shard on first `next()` no longer slips through (3153861530). - model/epochs.rs: hoist `#[allow(deprecated)]` from per-item attributes on the prop_compose!/test items (where the macro hides the lint site) to the whole `prop_tests` module — silences the test-build deprecation warnings introduced by the V1/V2 split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- shard.rs: const-context divisibility check now uses `u32::is_multiple_of` (stable-const since Rust 1.87) per `clippy::manual_is_multiple_of`. - model/epochs.rs: replace `let mut x = T::default(); x.f = ...; x` with a struct-literal `..Default::default()` form per `clippy::field_reassign_with_default`; also demote a doc comment attached to a `prop_compose!` invocation to a regular `//` comment (rustdoc can't attach docs to macro invocations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `crates/cardano/src/ashard/` module was merged into `crates/cardano/src/ewrap/` (it became EWRAP's per-shard leg) and the `EpochEndAccumulate` delta was renamed to `EWrapProgress`. The debug guide's tables, narrative, file-path references, and instrumentation hints still pointed at the old names and paths. Updates the Step 3 classification table, Step 5 work-unit narrative + source-file map, and Step 6 instrumentation table to: - talk about EWRAP/ESTART each having a per-shard leg + finalize, not a separate ASHARD work unit; - point at `ewrap/rewards.rs`, `crates/cardano/src/shard.rs`, and the per-unit `work_unit.rs` files; - reference `EWrapProgress` (not `EpochEndAccumulate`) and the `EpochWrapUpV2` / `EpochTransitionV2` deltas where boundary close / open is described. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compute pool live pledge globally in load_globals so each shard sees the full pledge sum, instead of summing only the owner accounts that fall in the current shard's key range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Shards the per-account legs of the epoch-boundary pipeline (Ewrap, Estart, Rupd) so they no longer materialise the full account / pending-rewards namespace in memory at once, and adds first-class crash-recovery semantics so a mid-boundary restart resumes at the right shard instead of replaying.
Final architecture
Five work units (
Genesis,Roll,Rupd,Ewrap,Estart) plus theForcedStopsentinel. Three of them are sharded:finalize()(global)RupdRupdProgressEpochState.incentives, clearsrupd_progressEwrapEpochState.endviaEWrapProgressEpochWrapUpV2to close the boundaryEstartAccountTransition; emitsEStartProgressEpochTransitionV2Boundary order:
… Roll → Rupd → Roll … → Ewrap shards → Ewrap finalize → Estart shards → Estart finalize → next epoch's Roll ….The lifecycle now looks like
initialize → start_shard..total_shards { load → compute → commit_wal → commit_state → commit_archive → commit_indexes } → finalize(crates/core/src/sync.rs::run_lifecycle). Each sharded unit caches itstotal_shardsandstart_shardininitializefrom persisted*_progressfields onEpochState, so a config change between shards or a crash mid-pipeline doesn't break the in-flight boundary.The
crate::shardmodule (crates/cardano/src/shard.rs) owns the credential-key-prefix partitioning, validated at compile time on theACCOUNT_SHARDSconstant and at runtime against anytotal_shardsvalue pulled from persistedShardProgress.total.Major commit clusters
948d472e…eaf8e08c,fd888b8e,c7fce119) — split EWRAP and ESTART into per-shard + finalize halves, then add the same shape to RUPD.WorkUnittrait growstotal_shards(),initialize(),finalize().9627c3ae,ae260792,2294bb32,2ff83b52,9a80080d,1a3722db,8174d2a8,f84792ba,aa835bb6,4728e21f,1b849366) — renames + module split + persisted shard count.07c3191c,62031ea8,a36ccf67) — collapses the standaloneAccountShard/AShardmodule into EWRAP's per-shard leg; renamesEpochEndAccumulate→EWrapProgress; replaces the runtimeaccount_shardsconfig field with the compile-time-validatedACCOUNT_SHARDSconstant.bb3a195a,5122e075,317b6766, plus the four most recent fixes below).Recent correctness + compatibility work
39237536) — bincode encodesCardanoDeltavariants by index and structs by position. AddingEWrapProgress/EStartProgress/RupdProgressmid-enum and appendingprev_*_progressfields ontoEpochWrapUp/EpochTransitionwould have broken every pre-PR WAL row. Variants are now appended to the end (indices 0..=38 frozen), and the original structs are kept verbatim under#[deprecated]whileEpochWrapUpV2/EpochTransitionV2carry the new undo state. Pre-upgrade WAL still decodes; new commit paths emit V2.3e50db60) —initialize()previously read only*_progress.total, and the lifecycle loop iterated0..total_shardsunconditionally. After a crash mid-boundary, restart replayed shards0..k-1— unsafe becauseAccountTransitionis non-idempotent. AddsWorkUnit::start_shard()(default0); the three sharded units cache*_progress.committedininitializeand the lifecycle loop now runsstart_shard..total_shards. Also propagatesload_epochfailures via?instead of silently falling back to defaults.ending_staterefresh (28fe4c39) —commit_finalizewas archiving the pre-EpochWrapUp snapshot becausestream_and_apply_namespace::<EpochState>only mutated the writer, never the in-memoryBoundaryWork.ending_state. A new EpochState-specific helper returns the post-apply singleton;commit_finalizeswaps it intoending_statebefore the archive write.56b6dbb9) —shard_key_rangespreviously useddebug_assert!fortotal_shards > 0andtotal_shards.is_multiple_of(256), so release builds could divide by zero or silently mis-partition on a corrupt persistedShardProgress.total. Promoted to unconditionalassert!/panic!.bd082462,0719956d,3492e5fb) — typo in genesis seeding comment, markdownlint MD040 on a fenced code block, staleBoundaryWork::load_ewrapreference, memory test now samples allocation across iteration (not just construction), clippy clean, and a sweep of theskills/debug-epoch-mismatchguide that still referenced the removedASHARDwork unit andashard/module paths.Test plan
cargo check --workspacecleancargo clippy --workspace --all-targetsclean (only the unrelatedchumskyfuture-incompat warning remains, from a transitive dep)cargo test -p dolos-cardano --lib— 110 unit tests pass, including newrestart_after_{estart,ewrap,rupd}_replays_*,restart_safety_no_skipped_work,epoch_boundary_emits_ewrap_then_estart, plus parallelepoch_wrap_up_v2_roundtrip/epoch_transition_v2_roundtripproptests covering the V1/V2 splitcargo test --workspace --exclude dolos— full workspace test sweep greencargo test --test bootstrap— crash-after-state-commit recovery (the helper now correctly skipsfinalize()to model the actual crash boundary)cargo test --test memory— shard-range iteration stays heap-bounded across both construction and full iterationcargo test --test epoch_pots --release— gold-standard end-to-end against DBSync ground truth (requires populated Mithril test instance)EpochStatearchive as a pre-PR run🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Tests
Documentation